Skip navigation
Por favor, use este identificador para citar o enlazar este ítem: https://repositorio.ufpe.br/handle/123456789/38480

Comparte esta pagina

Título : A linkage pipeline for place records using multi-view encoders
Autor : COUSSEAU, Vinícius de Moraes Rêgo
Palabras clave : Banco de dados; Resolução de entidades
Fecha de publicación : 14-ago-2020
Editorial : Universidade Federal de Pernambuco
Citación : COUSSEAU, Vinícius de Moraes Rêgo. A linkage pipeline for place records using multi-view encoders. 2020. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2020.
Resumen : Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. These entities, however, are inherently noisy and introduce several normalization problems, which need to be tackled in order to obtain a clean database. Record linkage, also known as entity resolution, refers to the detection of replicated data from potentially multiple sources, and is one of the most critical cleaning processes to be conducted in a data set. This work presents a novel record linkage solution for large scale Web-based places data, being composed of three steps: generation of potential duplicate place pairs, place pair deduplication, and clusterization of the classification results. The detection of duplicate places is the solution’s core, being a complex and seldom approached problem in this domain. Hence, the main contribution of this work is in the form of a model based on a deep neural network architecture, which utilizes encoders for different information levels of names, addresses, geographical coordinates, and categories. Each encoder uses distinct structures to generate representation vectors, which are concatenated, compared, and transported to a feature space that represents duplications and non-duplications. Additionally, this work proposes alternative classification models for real time usage by means of APIs. The complete solution is analyzed, with the classification model for place pairs being evaluated on top of two distinct data sets and compared against the stateof-the-art. As a result, the proposed solution is shown to handle large quantities of data in a production environment, and the classification model outperforms the baselines in both data sets, thus constituting a complete and efficient solution for the record linkage problem in the places data domain.
URI : https://repositorio.ufpe.br/handle/123456789/38480
Aparece en las colecciones: Dissertações de Mestrado - Ciência da Computação

Ficheros en este ítem:
Fichero Descripción Tamaño Formato  
DISSERTAÇÃO Vinícius de Moraes Rêgo Cousseau.pdf3,47 MBAdobe PDFVista previa
Visualizar/Abrir


Este ítem está protegido por copyright original



Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons